1 Abstract
The muHVT package is a collection of R functions to facilitate building topology preserving maps for rich multivariate data. Tending towards a big data preponderance, a large number of rows. A collection of R functions for this typical workflow is organized below :
Data Compression: Vector quantization (VQ), HVQ (hierarchical vector quantization) using means or medians. This step compresses the rows (long data frame) using a compression objective
Data Projection: Dimension projection of the compressed cells to 1D,2D and 3D with the Sammons Non-linear Algorithm. This step creates topology preserving map coordinates into the desired output dimension. Additionally, embedding can be learned from the compressed data to further enhance the representation of the multivariate data. Embedding provide a lower-dimensional space where high-dimensional vectors, such as word vectors, can be translated, making it easier to perform machine learning tasks on large inputs.
Tessellation: Create cells required for object visualization using the Voronoi Tessellation method, package includes heatmap plots for hierarchical Voronoi tessellations (HVT). This step enables data insights, visualization, and interaction with the topology preserving map. Useful for semi-supervised tasks
Prediction: Scoring new data sets and recording their assignment using the map objects from the above steps, in a sequence of maps if required
2 Compress: Vector Quantization
This package can perform vector quantization using the following algorithms -
- Hierarchical Vector Quantization using \(k-means\)
- Hierarchical Vector Quantization using \(k-medoids\)
2.1 Hierarchical VQ
2.1.1 Using k-means
- The k-means algorithm randomly selects k data points as initial means
- k clusters are formed by assigning each data point to its closest cluster mean using the Euclidean distance
- Virtual means for each cluster are calculated by using all datapoints contained in a cluster
The second and third steps are iterated until a predefined number of iterations is reached or the clusters converge. The runtime for the algorithm is O(n).
2.1.2 Using k-medoids
- The k-medoids algorithm randomly selects k data points as initial means out of the n data points as the medoids.
- k clusters are formed by assigning each data point to its closest medoid by using any common distance metric methods.
- Virtual means for each cluster are calculated by using all datapoints contained in a cluster
The second and third steps are iterated until a predefined number of iterations is reached or the clusters converge. The runtime for the algorithm is O(k * (n-k)^2) .
2.1.3 Hierarchical Vector Quantization
The algorithm divides the dataset recursively into cells using \(k-means\) or \(k-medoids\) algorithm. The maximum number of subsets are decided by setting \(n_cells\) to, say five, in order to divide the dataset into maximum of five subsets. These five subsets are further divided into five subsets(or less), resulting in a total of twenty five (5*5) subsets. The recursion terminates when the cells either contain less than three data point or a stop criterion is reached. In this case, the stop criterion is set to when the cell error exceeds the quantization threshold.
The steps for this method are as follows :
- Select k(number of cells), depth and quantization error threshold
- Perform quantization (using \(k-means\) or \(k-medoids\)) on the input dataset
- Calculate quantization error for each of the k cells
- Compare the quantization error for each cell to quantization error threshold
- Repeat steps 2 to 4 for each of the k cells whose quantization error is above threshold until stop criterion is reached.
The stop criterion is when the quantization error of a cell satisfies one of the below conditions
- reaches below quantization error threshold
- there are less than three data points in the cell
- the user specified depth has been attained
The quantization error for a cell is defined as follows :
\[QE = \max_i(||A-F_i||_{p})\]
where
- \(A\) is the centroid of the cell
- \(F_i\) represents a data point in the cell
- \(m\) is the number of points in the cell
- \(p\) is the \(p\)-norm metric. Here \(p\) = 1 represents L1 Norm and \(p\) = 2 represents L2 Norm.
2.1.4 Quantization Error
Let us try to understand quantization error with an example.
Figure 1: The Voronoi tessellation for level 1 shown for the 5 cells with the points overlayed
An example of a 2 dimensional VQ is shown above.
In the above image, we can see 5 cells with each cell containing a certain number of points. The centroid for each cell is shown in blue. These centroids are also known as codewords since they represent all the points in that cell. The set of all codewords is called a codebook.
Now we want to calculate quantization error for each cell. For the
sake of simplicity, let’s consider only one cell having centroid
A and m data points \(F_i\) for calculating quantization
error.
For each point, we calculate the distance between the point and the centroid.
\[ d = ||A - F_i||_{p} \]
In the above equation, p = 1 means L1_Norm distance
whereas p = 2 means L2_Norm distance. In the package, the
L1_Norm distance is chosen by default. The user can pass
either L1_Norm, L2_Norm or a custom function
to calculate the distance between two points in n dimensions.
\[QE = \max_i(||A-F_i||_{p})\]
Now, we take the maximum calculated distance of all m points. This
gives us the furthest distance of a point in the cell from the centroid,
which we refer to as Quantization Error. If the
Quantization Error is higher than the given threshold, the centroid/
codevector is not a good representation for the points in the cell. Now
we can perform further Vector Quantization on these points and repeat
the above steps.
Please note that the user can select mean, max or any custom function
to calculate the Quantization Error. The custom function takes a vector
of m value (where each value is a distance between point in
n dimensions and centroids) and returns a single value
which is the Quantization Error for the cell.
If we select mean as the error metric, the above
Quantization Error equation will look like this :
\[QE = \frac{1}{m}\sum_{i=1}^m||A-F_i||_{p}\]
3 Project: Sammon’s projection
Sammon’s projection is an algorithm that maps a high-dimensional space to a space of lower dimensionality while attempting to preserve the structure of inter-point distances in the projection. It is particularly suited for use in exploratory data analysis and is usually considered a non-linear approach since the mapping cannot be represented as a linear combination of the original variables. The centroids are plotted in 2D after performing Sammon’s projection at every level of the tessellation.
Denoting the distance between \(i^{th}\) and \(j^{th}\) objects in the original space by \(d_{ij}^*\), and the distance between their projections by \(d_{ij}\). Sammon’s mapping aims to minimize the below error function, which is often referred to as Sammon’s stress or Sammon’s error
\[E=\frac{1}{\sum_{i<j} d_{ij}^*}\sum_{i<j}\frac{(d_{ij}^*-d_{ij})^2}{d_{ij}^*}\]
The minimization of this can be performed either by gradient descent, as proposed initially, or by other means, usually involving iterative methods. The number of iterations need to be experimentally determined and convergent solutions are not always guaranteed. Many implementations prefer to use the first Principal Components as a starting configuration.
4 Voronoi Tessellations
A Voronoi diagram is a way of dividing space into a number of regions. A set of points (called seeds, sites, or generators) is specified beforehand and for each seed, there will be a corresponding region consisting of all points within proximity of that seed. These regions are called Voronoi cells. It is complementary to Delaunay triangulation.
Tessellate: Constructing Voronoi Tesselations
In this package, we use sammons from the package
MASS to project higher dimensional data to a 2D space. The
function hvq called from the HVT function
returns hierarchical quantized data which will be the input for
construction of the tessellations. The data is then represented in 2D
coordinates and the tessellations are plotted using these coordinates as
centroids. We use the package deldir for this purpose. The
deldir package computes the Delaunay triangulation (and
hence the Dirichlet or Voronoi tessellation) of a planar point set
according to the second (iterative) algorithm of Lee and Schacter. For
subsequent levels, transformation is performed on the 2D coordinates to
get all the points within its parent tile. Tessellations are plotted
using these transformed points as centroids. The lines in the
tessellations are chopped in places so that they do not protrude outside
the parent polygon. This is done for all the subsequent levels.
4.1 Example Usage : Visualizing Multidimensional Data with Sammon’s Projection using Torus (Donut)
In this section, we will see how we can use the package to visualize multidimensional data by projecting them to two dimensions using Sammon’s projection.
First of all, let us see how to generate data for torus. We are using
a library geozoo for this purpose. Geo Zoo (stands for
Geometric Zoo) is a compilation of geometric objects ranging from three
to 10 dimensions. Geo Zoo contains regular or well-known objects, eg
cube and sphere, and some abstract objects, e.g. Boy’s surface, Torus
and Hyper-Torus.
Here, we will generate a 3D torus with 9000 points.
set.seed(240)
# Here p represents dimension of object
# n represents number of points
torus <- geozoo::torus(p = 3,n = 9000)
torus_df <- data.frame(torus$points)
colnames(torus_df) <- c("x","y","z")Now let’s do some EDA on the data. First of all, we will see how the data looks like
DT::datatable(head(torus_df), rownames = FALSE,
class = 'cell-border stripe',
options = list(scrollX = TRUE))Now let’s have a look at structure and summary of the data.
str(torus_df)
#> 'data.frame': 9000 obs. of 3 variables:
#> $ x: num -2.63 -1.42 -1.03 1.88 -1.95 ...
#> $ y: num 0.566 -0.89 1.107 0.189 -2.251 ...
#> $ z: num -0.725 0.945 -0.873 0.994 0.207 ...summary(torus_df)
#> x y z
#> Min. :-2.99767 Min. :-2.999343 Min. :-0.9999999
#> 1st Qu.:-1.15065 1st Qu.:-1.120632 1st Qu.:-0.7130951
#> Median :-0.01899 Median : 0.001856 Median : 0.0033675
#> Mean :-0.00914 Mean : 0.004195 Mean : 0.0001237
#> 3rd Qu.: 1.13001 3rd Qu.: 1.130708 3rd Qu.: 0.7138584
#> Max. : 2.99713 Max. : 2.999308 Max. : 1.0000000Now let’s try to visualize the object in a 3D Space.
scaling_factor <- 0.5
library(rgl)
plotids <- with(torus_df, plot3d(torus_df$x, torus_df$y, torus_df$z,
type="s", col=c("white","red"),xlab = "X", ylab = "Y", zlab = "Z",zlim = c(0, 15*scaling_factor)))
rglwidget(elementId = "plot3drgl")Now let’s try to use the package and project the above 3D object in a 2D Space. However, let’s try to comprehend the HVT function first before moving on.
HVT(
dataset,
min_compression_perc,
n_cells,
depth,
quant.err,
projection.scale,
normalize = T,
distance_metric = c("L1_Norm", "L2_Norm"),
error_metric = c("mean", "max"),
quant_method = c("kmeans", "kmedoids"),
diagnose = TRUE,
hvt_validation = FALSE,
train_validation_split_ratio = 0.8
)Each of the parameters of HVT function have been explained below :
dataset- A dataframe with numeric columnsmin_compression_perc- An integer indicating the minimum percent compression rate to be achieved for the datasetn_cells- An integer indicating the number of cells per hierarchy (layers)depth- An integer indicating the number of layers. (1 = No hierarchy, 2 = 2 layers, etc …)quant.error- A number indicating the quantization error threshold. A cell will only breakdown into further cells if the quantization error of the cell is above the defined quantization error thresholddistance_metric- The distance metric can beL1_NormorL2_Norm.L1_Normis selected by default. The distance metric is used to calculate the distance between anndimensional point and centroid. The user can also pass a custom function to calculate this distanceerror_metric- The error metric can bemeanormax.maxis selected by default.maxwill return the max ofmvalues andmeanwill take mean ofmvalues where each value is a distance between a point and centroid of the cell. Moreover, the user can also pass a custom function to calculate the error metricquant_method- The quantization method can bekmeansorkmedoids.kmeansis selected by defaultnormalize- A logical value indicating whether the columns in your dataset need to be normalized. Default value is TRUE. The algorithm supports Z-score normalizationdiagnose- A logical value indicating whether user wants to perform diagnostics on the model. Default value is TRUE.hvt_validation- A logical value indicating whether user wants to holdout a validation set and find mean absolute deviation of the validation points from the centroid. Default value is FALSE.train_validation_split_ratio- A numeric value indicating train validation split ratio. This argument is only used when hvt_validation has been set to TRUE. Default value for the argument is 0.8
First we will perform Hierarchical Vector Quantization using the below mentioned model parameters.
Model Parameters
- Number of Cells at each Level = 100
- Maximum Depth = 1
- Quantization Error Threshold = 0.1
- Error Metric = Max
- Distance Metric = Manhattan
set.seed(240)
hvt.torus <- muHVT::HVT(
torus_df,
n_cells = 100,
depth = 1,
quant.err = 0.1,
projection.scale = 10,
normalize = T,
distance_metric = "L1_Norm",
error_metric = "max",
quant_method = "kmeans"
)Now let’s try to understand plotHVT function. The parameters have been explained in detail below
plotHVT(hvt.results, line.width, color.vec, pch1 = 21, centroid.size = 3, title = NULL, maxDepth = 1)hvt.results- A list containing the output of the HVT function which has the details of the tessellations to be plottedline.width- A vector indicating the line widths of the tessellation boundaries for each layercolor.vec- A vector indicating the colors of the tessellations boundaries at each layerpch1- Symbol type of the centroids of the tessellations (parent levels). Refer points (default = 21)centroid.size- Size of centroids of first level tessellations (default = 3)title- Set a title for the plot (default = NULL)
For better visualisation, let’s plot the Voronoi tessellation.
muHVT::plotHVT(
hvt.torus,
line.width = c(0.4),
color.vec = c("#141B41"),
centroid.size = 0.8,
maxDepth = 1
)Let’s checkout the compression summary .
compressionSummaryTable(hvt.torus[[3]]$compression_summary)| segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
|---|---|---|---|---|
| 1 | 100 | 0 | 0 | n_cells: 100 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans |
As it can be seen from the table above, none of the 100 cells have hit the quantization threshold error
Now let’s try again with the below mentioned set of model parameters.
Model Parameters
- Number of Cells at each Level = 300
- Maximum Depth = 1
- Quantization Error Threshold = 0.1
- Error Metric = Max
- Distance Metric = Manhattan
set.seed(240)
hvt.torus2 <- muHVT::HVT(
torus_df,
n_cells = 300,
depth = 1,
quant.err = 0.1,
projection.scale = 10,
normalize = T,
distance_metric = "L1_Norm",
error_metric = "max",
quant_method = "kmeans"
)For better visualisation, let’s plot the Voronoi tessellation
muHVT::plotHVT(
hvt.torus2,
line.width = c(0.4),
color.vec = c("#141B41"),
centroid.size = 0.8,
maxDepth = 1
)Let’s checkout the compression summary for torus.
compressionSummaryTable(hvt.torus2[[3]]$compression_summary)| segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
|---|---|---|---|---|
| 1 | 300 | 43 | 0.14 | n_cells: 300 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans |
It can be observed from the table above that only 43 cells
out of 300 i.e. 14% of the cells hit the Quantization Error
threshold
Now let’s try again with the below mentioned set of model parameters.
Model Parameters
- Number of Cells at each Level = 900
- Maximum Depth = 1
- Quantization Error Threshold = 0.1
- Error Metric = Max
- Distance Metric = Manhattan
set.seed(240)
hvt.torus3 <- muHVT::HVT(
torus_df,
n_cells = 900,
depth = 1,
quant.err = 0.1,
projection.scale = 10,
normalize = T,
distance_metric = "L1_Norm",
error_metric = "max",
quant_method = "kmeans"
)For better visualisation, let’s plot the Voronoi tessellation
muHVT::plotHVT(
hvt.torus3,
line.width = c(0.4),
color.vec = c("#141B41"),
centroid.size = 0.6,
maxDepth = 1
)Let’s check the compression summary for torus.
compressionSummaryTable(hvt.torus3[[3]]$compression_summary)| segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
|---|---|---|---|---|
| 1 | 900 | 839 | 0.93 | n_cells: 900 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans |
By increasing the number of cells to 900, we were
successfully able to compress 93% of the data
We will now overlay all the features as heatmap over the Voronoi Tessellation plot for better visualization.
Let’s have look at the function hvtHmap which we will
use to overlay a variable as heatmap.
hvtHmap(hvt.results, dataset, child.level, hmap.cols, color.vec ,line.width, palette.color = 6)hvt.results- A list of results obtained from the HVT functiondataset- A dataframe containing the variables to overlay as a heatmap. The user can pass an external dataset or the dataset that was used to perform hierarchical vector quantization. The dataset should have the same number of points as the dataset used to perform hierarchical Vector Quantization in the HVT functionchild.level- A number indicating the level for which the heat map is to be plottedhmap.cols- The column number of column name from the dataset indicating the variables for which the heat map is to be plotted. To plot the quantization error as heatmap, pass'quant_error'. Similarly to plot the no of points in each cell as heatmap, pass'no_of_points'as a parametercolor.vec- A color vector such that length(color.vec) = child.level (default = NULL)line.width- A line width vector such that length(line.width) = child.level (default = NULL)palette.color- A number indicating the heat map color palette. 1 - rainbow, 2 - heat.colors, 3 - terrain.colors, 4 - topo.colors, 5 - cm.colors, 6 - BlCyGrYlRd (Blue,Cyan,Green,Yellow,Red) color (default = 6)show.points- A boolean indicating whether the centroids should be plotted on the tessellations (default = FALSE)
Now let’s plot the Voronoi Tessellation with the heatmap overlaid for all the features in the torus data for better visualization.
metric_list <- colnames(hvt.torus3[[3]]$summary)
metric_list <- metric_list[6:9]
hmap <- list()
hmap <- lapply(1:length(metric_list), function(x){
muHVT::hvtHmap(
hvt.torus3,
torus_df,
child.level = 1,
hmap.cols = metric_list[[x]],
line.width = c(0.4),
color.vec = c("#141B41"),
palette.color = 6,
centroid.size = 1,
show.points = T,
quant.error.hmap = 0.1,
n_cells.hmap = 900
)
})grid.arrange(hmap[[1]], nrow = 1, ncol=1)grid.arrange(hmap[[2]] ,nrow = 1, ncol=1)grid.arrange(hmap[[3]], nrow = 1, ncol=1)grid.arrange(hmap[[4]], nrow = 1, ncol=1)Now lets explore a little more and try to project the torus(3D) object to 2D Space with the below mentioned model parameters at level 2
Model Parameters
- Number of Cells at each Level = 20
- Maximum Depth = 2
- Quantization Error Threshold = 0.1
- Error Metric = Max
- Distance Metric = Manhattan
set.seed(240)
hvt.torus4 <- muHVT::HVT(
torus_df,
n_cells = 20,
depth = 2,
quant.err = 0.1,
projection.scale = 10,
normalize = T,
distance_metric = "L1_Norm",
error_metric = "max",
quant_method = "kmeans"
)For better visualisation, let’s plot the Voronoi tessellation.
muHVT::plotHVT(
hvt.torus4,
line.width = c(0.6,0.4),
color.vec = c("#141B41","#0582CA"),
centroid.size = 0.8,
maxDepth = 2
)Let’s checkout the compression summary for torus.
compressionSummaryTable(hvt.torus4[[3]]$compression_summary)| segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
|---|---|---|---|---|
| 1 | 20 | 0 | 0 | n_cells: 20 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans |
| 2 | 400 | 114 | 0.28 | n_cells: 20 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans |
It can be observed from the table above that only 114 cells
out of 400 i.e. 28% of the cells hit the Quantization Error
threshold
Now let’s try again with the below mentioned set of model parameters.
Model Parameters
- Number of Cells at each Level = 26
- Maximum Depth = 2
- Quantization Error Threshold = 0.1
- Error Metric = Max
- Distance Metric = Manhattan
set.seed(240)
hvt.torus5 <- muHVT::HVT(
torus_df,
n_cells = 26,
depth = 2,
quant.err = 0.1,
projection.scale = 10,
normalize = T,
distance_metric = "L1_Norm",
error_metric = "max",
quant_method = "kmeans"
)For better visualisation, let’s plot the Voronoi tessellation.
muHVT::plotHVT(
hvt.torus5,
line.width = c(0.6,0.4),
color.vec = c("#141B41","#0582CA"),
centroid.size = 0.8,
maxDepth = 2
)Let’s checkout the compression summary for torus.
compressionSummaryTable(hvt.torus5[[3]]$compression_summary)| segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
|---|---|---|---|---|
| 1 | 26 | 0 | 0 | n_cells: 26 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans |
| 2 | 666 | 535 | 0.8 | n_cells: 26 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans |
It can be observed from the table above that only 535 cells
out of 666 i.e. 80% of the cells hit the Quantization Error
threshold
Now let’s plot the Voronoi Tessellation with the heatmap overlaid for all the features in the torus data at level 2 for better visualization.
metric_list <- colnames(hvt.torus3[[3]]$summary)
metric_list <- metric_list[6:9]
hmap <- list()
hmap <- lapply(1:length(metric_list), function(x){
muHVT::hvtHmap(
hvt.torus5,
torus_df,
child.level = 2,
hmap.cols = metric_list[[x]],
line.width = c(0.6,0.4),
color.vec = c("#141B41","#0582CA"),
palette.color = 6,
centroid.size = 1,
show.points = T,
quant.error.hmap = 0.1,
n_cells.hmap = 26
)
})grid.arrange(hmap[[1]], nrow = 1, ncol=1)grid.arrange(hmap[[2]] ,nrow = 1, ncol=1)grid.arrange(hmap[[3]], nrow = 1, ncol=1)grid.arrange(hmap[[4]], nrow = 1, ncol=1)4.2 Example Usage of Hierarchical VQ with Sammons Projection and Tessellation on Personal Computers Dataset.
In this section, we will use the
Prices of Personal Computers dataset. This dataset contains
6259 observations and 10 features. The dataset observes the price from
1993 to 1995 of 486 personal computers in the US. The variables are
price, speed, ram, screen, cd, etc. The dataset can be downloaded from
here.
In this example, we will compress this dataset by using hierarchical VQ via k-means and visualize the Voronoi Tessellation plots using Sammons projection. Later on, we will overlay all the variables as a heatmap to generate further insights.
Here, we load the data and store into a variable
computers.
set.seed(240)
# Load data from csv files
computers <- read.csv("https://raw.githubusercontent.com/Mu-Sigma/muHVT/master/vignettes/sample_dataset/Computers.csv")Let’s explore the Personal Computers Dataset.
# Quick peek
DT::datatable(head(computers), rownames = FALSE,
class = 'cell-border stripe',
options = list(scrollX = TRUE))Now, let us check the structure of the data and analyse its summary.
str(computers)
#> 'data.frame': 6259 obs. of 11 variables:
#> $ X : int 1 2 3 4 5 6 7 8 9 10 ...
#> $ price : int 1499 1795 1595 1849 3295 3695 1720 1995 2225 2575 ...
#> $ speed : int 25 33 25 25 33 66 25 50 50 50 ...
#> $ hd : int 80 85 170 170 340 340 170 85 210 210 ...
#> $ ram : int 4 2 4 8 16 16 4 2 8 4 ...
#> $ screen : int 14 14 15 14 14 14 14 14 14 15 ...
#> $ cd : chr "no" "no" "no" "no" ...
#> $ multi : chr "no" "no" "no" "no" ...
#> $ premium: chr "yes" "yes" "yes" "no" ...
#> $ ads : int 94 94 94 94 94 94 94 94 94 94 ...
#> $ trend : int 1 1 1 1 1 1 1 1 1 1 ...summary(computers)
#> X price speed hd
#> Min. : 1 Min. : 949 Min. : 25.00 Min. : 80.0
#> 1st Qu.:1566 1st Qu.:1794 1st Qu.: 33.00 1st Qu.: 214.0
#> Median :3130 Median :2144 Median : 50.00 Median : 340.0
#> Mean :3130 Mean :2220 Mean : 52.01 Mean : 416.6
#> 3rd Qu.:4694 3rd Qu.:2595 3rd Qu.: 66.00 3rd Qu.: 528.0
#> Max. :6259 Max. :5399 Max. :100.00 Max. :2100.0
#> ram screen cd multi
#> Min. : 2.000 Min. :14.00 Length:6259 Length:6259
#> 1st Qu.: 4.000 1st Qu.:14.00 Class :character Class :character
#> Median : 8.000 Median :14.00 Mode :character Mode :character
#> Mean : 8.287 Mean :14.61
#> 3rd Qu.: 8.000 3rd Qu.:15.00
#> Max. :32.000 Max. :17.00
#> premium ads trend
#> Length:6259 Min. : 39.0 Min. : 1.00
#> Class :character 1st Qu.:162.5 1st Qu.:10.00
#> Mode :character Median :246.0 Median :16.00
#> Mean :221.3 Mean :15.93
#> 3rd Qu.:275.0 3rd Qu.:21.50
#> Max. :339.0 Max. :35.00Let us first split the data into train and test. We will use 80% of the data as train and remaining as test.
noOfPoints <- dim(computers)[1]
trainLength <- as.integer(noOfPoints * 0.8)
trainComputers <- computers[1:trainLength,]
testComputers <- computers[(trainLength+1):noOfPoints,]K-means is not suitable for factor variables as the sample space for factor variables is discrete. A Euclidean distance function on such a space isn’t really meaningful. Hence, we will delete the factor variables(X, cd, multi, premium, trend) in our dataset.
Here we keep the original trainComputers and
testComputers as we will use the variables from this
dataset to overlay as heatmap and generate some insights.
trainComputers <-
trainComputers %>% dplyr::select(-c(X, cd, multi, premium, trend))
testComputers <-
testComputers %>% dplyr::select(-c(X, cd, multi, premium, trend))Now, lets have a look at the scaled training dataset containing (5007 data points)
trainComputers <- scale(trainComputers)
metric_list <- colnames(trainComputers)
scale_attr <- attributes(trainComputers)
trainComputers <- trainComputers %>% as.data.frame()
DT::datatable(head(trainComputers), rownames = FALSE,
class = 'cell-border stripe',
options = list(scrollX = TRUE))Now, lets have a look at the scaled testing dataset containing (1252 data points).
testComputers <- scale(testComputers, center = scale_attr$`scaled:center`, scale = scale_attr$`scaled:scale`) %>% as.data.frame()
DT::datatable(head(testComputers), rownames = FALSE,
class = 'cell-border stripe',
options = list(scrollX = TRUE))Let us try to understand the HVT function first.
HVT(
dataset,
min_compression_perc,
n_cells,
depth,
quant.err,
projection.scale,
normalize = T,
distance_metric = c("L1_Norm", "L2_Norm"),
error_metric = c("mean", "max"),
quant_method = c("kmeans", "kmedoids"),
diagnose = TRUE,
hvt_validation = FALSE,
train_validation_split_ratio = 0.8
)Each of the parameters of HVT function have been explained below :
dataset- A dataframe with numeric columnsmin_compression_perc- An integer indicating the minimum percent compression rate to be achieved for the datasetn_cells- An integer indicating the number of cells per hierarchy (layers)depth- An integer indicating the number of layers. (1 = No hierarchy, 2 = 2 layers, etc …)quant.error- A number indicating the quantization error threshold. A cell will only breakdown into further cells if the quantization error of the cell is above the defined quantization error thresholddistance_metric- The distance metric can beL1_NormorL2_Norm.L1_Normis selected by default. The distance metric is used to calculate the distance between anndimensional point and centroid. The user can also pass a custom function to calculate this distanceerror_metric- The error metric can bemeanormax.maxis selected by default.maxwill return the max ofmvalues andmeanwill take mean ofmvalues where each value is a distance between a point and centroid of the cell. Moreover, the user can also pass a custom function to calculate the error metricquant_method- The quantization method can bekmeansorkmedoids.kmeansis selected by defaultnormalize- A logical value indicating whether the columns in your dataset need to be normalized. Default value is TRUE. The algorithm supports Z-score normalizationdiagnose- A logical value indicating whether user wants to perform diagnostics on the model. Default value is TRUE.hvt_validation- A logical value indicating whether user wants to holdout a validation set and find mean absolute deviation of the validation points from the centroid. Default value is FALSE.train_validation_split_ratio- A numeric value indicating train validation split ratio. This argument is only used when hvt_validation has been set to TRUE. Default value for the argument is 0.8
First we will perform Hierarchical Vector Quantization using the below mentioned model parameters to generate map A.
Model Parameters
- Number of Cells at each Level = 1001
- Maximum Depth = 1
- Quantization Error Threshold = 0.1
- Error Metric = Max
- Distance Metric = Manhattan
set.seed(240)
hvt.results <- list()
map_A <- muHVT::HVT(trainComputers,
n_cells = 1001,
depth = 1,
quant.err = 0.1,
projection.scale = 10,
normalize = F,
distance_metric = "L1_Norm",
error_metric = "max",
quant_method = "kmeans",
diagnose = F)map_A[[3]] gives us detailed
information about the hierarchical vector quantized data.
map_A[[3]][['summary']] gives a nice
tabular data containing no of points, Quantization Error and the
codebook.
The datatable displayed below is the summary from map A
DT::datatable(map_A[[3]]$summary, rownames = FALSE,
class = 'cell-border stripe',
options = list(scrollX = TRUE))Now let us understand what each column in the above summary table means:
Segment.Level- Layers of the cell. In this case, we have performed Vector Quantization for depth 1. Hence Segment Level is 1Segment.Parent- Parent segment of the cellSegment.Child (Cell.Number)- The children of a particular cell. In this case, it is the total number of cells at which we achieved the defined compression percentagen- No of points in each cellCell.ID- Cell_ID’s are generated for the multivariate data using 1-D Sammon’s Projection algorithmQuant.Error- Quantization Error for each cell
All the columns after this will contain centroids for each cell. They can also be called a codebook, which represents a collection of all centroids or codewords.
Now let’s check the compression summary. The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.
compressionSummaryTable(map_A[[3]]$compression_summary)| segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
|---|---|---|---|---|
| 1 | 1001 | 831 | 0.83 | n_cells: 1001 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans |
As it can be seen from the table above,
83% of the cells have hit the quantization
threshold error
Now let’s try to understand plotHVT function. The parameters have been explained in detail below
plotHVT(map_A, line.width, color.vec, pch1 = 21, centroid.size = 3, title = NULL, maxDepth = 1)map_A- A list containing the output of the HVT function which has the details of the tessellations to be plottedline.width- A vector indicating the line widths of the tessellation boundaries for each layercolor.vec- A vector indicating the colors of the tessellations boundaries at each layerpch1- Symbol type of the centroids of the tessellations (parent levels). Refer points (default = 21)centroid.size- Size of centroids of first level tessellations (default = 3)title- Set a title for the plot (default = NULL)
Let’s plot the Voronoi tessellation for layer 1 (map A)
# Voronoi tessellation plot for level one
#testPlot
muHVT::plotHVT(map_A,
line.width = c(0.6),
color.vec = c("#141B41"),
centroid.size = 1.5,
maxDepth = 1)Figure 3: The Voronoi Tessellation for layer 1 (map A) shown for the 1001 cells in the dataset ’computers’
We will now overlay all the variable as heatmap over the Voronoi Tessellation plot for better visualization.
Let’s have look at the function hvtHmap which we will
use to overlay a variable as heatmap.
hvtHmap(map_A, dataset, child.level, hmap.cols, color.vec ,line.width, palette.color = 6)map_A- A list of results obtained from the HVT functiondataset- A dataframe containing the variables to overlay as a heatmap. The user can pass an external dataset or the dataset that was used to perform hierarchical vector quantization. The dataset should have the same number of points as the dataset used to perform hierarchical Vector Quantization in the HVT functionchild.level- A number indicating the level for which the heat map is to be plottedhmap.cols- The column number of column name from the dataset indicating the variables for which the heat map is to be plotted. To plot the quantization error as heatmap, pass'quant_error'. Similarly to plot the no of points in each cell as heatmap, pass'no_of_points'as a parametercolor.vec- A color vector such that length(color.vec) = child.level (default = NULL)line.width- A line width vector such that length(line.width) = child.level (default = NULL)palette.color- A number indicating the heat map color palette. 1 - rainbow, 2 - heat.colors, 3 - terrain.colors, 4 - topo.colors, 5 - cm.colors, 6 - BlCyGrYlRd (Blue,Cyan,Green,Yellow,Red) color (default = 6)show.points- A boolean indicating whether the centroids should be plotted on the tessellations (default = FALSE)
Now let’s plot the Voronoi Tessellation with the heatmap overlaid for all the features in the computers dataset for better visualization.
metric_list <- colnames(trainComputers)
hmap <- list()
hmap <- lapply(1:length(metric_list), function(x){
muHVT::hvtHmap(
map_A,
trainComputers,
child.level = 1,
hmap.cols = metric_list[[x]],
line.width = c(0.2),
color.vec = c("#141B41"),
palette.color = 6,
centroid.size = 1.5,
show.points = T,
quant.error.hmap = 0.1,
n_cells.hmap = 1001
)
})grid.arrange(hmap[[1]], nrow = 1, ncol=1)grid.arrange(hmap[[2]], nrow = 1, ncol=1)grid.arrange(hmap[[3]], nrow = 1, ncol=1)grid.arrange(hmap[[4]], nrow = 1, ncol=1)grid.arrange(hmap[[5]], nrow = 1, ncol=1)grid.arrange(hmap[[6]], nrow = 1, ncol=1)In this section, we will manually figure out the novelty cells from the plotted map A and store it in identified_Novelty_cells variable.
The identified_Novelty_cells along with the map A is passed to removeNovelty() function.
The output of removeNovelty() function is a list of two items: a dataset with novelty records and the subset of the dataset without novelty records.
library(plyr)
identified_Novelty_cells <<- c(213,384)
output_list <- removeNovelty(identified_Novelty_cells, map_A)
dataset_with_novelty <- output_list[[1]]
dataset_without_novelty <- output_list[[2]]The datatable displayed below are the data with novelties.
colnames(dataset_with_novelty) <- c("Cell.ID","Segment.Child","price","speed","hd","ram","screen","ads")
DT::datatable(dataset_with_novelty, rownames = FALSE,
class = 'cell-border stripe',
options = list(scrollX = TRUE))The plotCells function is used to plot the Voronoi tessellation using the compressed HVT map (map A) and highlights the identified outlier cell(s) in red on the map.
Let’s look at the Voronoi tessellation with the novelty cell(s) in the map highlighted in red.
plotCells(identified_Novelty_cells, map_A)Figure 4: The Voronoi Tessellation with the novelty cells in the map highlighted in red
We pass the dataframe with novelty records to HVT function along with below mentioned model parameters to generate map B (layer 2).
Model Parameters
- Number of Cells at each Level = 3
- Maximum Depth = 1
- Quantization Error Threshold = 0.1
- Error Metric = Max
- Distance Metric = Manhattan
dataset_with_novelty <- dataset_with_novelty[,-1:-2]
map_B <- list()
map_B <- muHVT::HVT(dataset_with_novelty,
n_cells = 3,
depth = 1,
quant.err = 0.1,
projection.scale = 10,
normalize = F,
distance_metric = "L1_Norm",
error_metric = "max",
quant_method = "kmeans",
diagnose = F)The datatable displayed below is the summary from map B
DT::datatable(map_B[[3]]$summary, rownames = FALSE,
class = 'cell-border stripe',
options = list(scrollX = TRUE))Now let’s check the compression summary. The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.
compressionSummaryTable(map_B[[3]]$compression_summary)| segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
|---|---|---|---|---|
| 1 | 3 | 3 | 1 | n_cells: 3 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans |
It can be observed from the table above that 3 cells out of 3
i.e. 100% of the cells has hit the
Quantization Error threshold
With the Novelties removed, we construct another hierarchical Voronoi tessellation map C layer 2 on the dataset without Novelty and below mentioned model parameters.
Model Parameters
- Number of Cells at each Level = 1001
- Maximum Depth = 1
- Quantization Error Threshold = 0.1
- Error Metric = Max
- Distance Metric = Manhattan
map_C <- list()
map_C <- muHVT::HVT(dataset_without_novelty,
n_cells = 1001,
depth = 1,
quant.err = 0.1,
projection.scale = 10,
normalize = F,
distance_metric = "L1_Norm",
error_metric = "max",
quant_method = "kmeans",
diagnose = F)The datatable displayed below is the summary from map C
DT::datatable(map_C[[3]]$summary, rownames = FALSE,
class = 'cell-border stripe',
options = list(scrollX = TRUE))Now let’s check the compression summary. The table below shows no of cells, no of cells having quantization error below threshold and percentage of cells having quantization error below threshold for each level.
compressionSummaryTable(map_C[[3]]$compression_summary)| segmentLevel | noOfCells | noOfCellsBelowQuantizationError | percentOfCellsBelowQuantizationErrorThreshold | parameters |
|---|---|---|---|---|
| 1 | 1001 | 829 | 0.83 | n_cells: 1001 quant.err: 0.1 distance_metric: L1_Norm error_metric: max quant_method: kmeans |
It can be observed from the table above that 829 cells out of
1001 i.e. 83% of the cells has hit the
Quantization Error threshold
Now let’s plot the Voronoi Tessellation with the heatmap overlaid for all the features in the computers dataset for better visualization.
metric_list <- colnames(trainComputers)
hmap <- list()
hmap <- lapply(1:length(metric_list), function(x){
muHVT::hvtHmap(
map_C,
trainComputers,
child.level = 1,
hmap.cols = metric_list[[x]],
line.width = c(0.2),
color.vec = c("#141B41"),
palette.color = 6,
centroid.size = 1.5,
show.points = T,
quant.error.hmap = 0.1,
n_cells.hmap = 1001
)
})grid.arrange(hmap[[1]], nrow = 1, ncol=1)grid.arrange(hmap[[2]], nrow = 1, ncol=1)grid.arrange(hmap[[3]], nrow = 1, ncol=1)grid.arrange(hmap[[4]], nrow = 1, ncol=1)grid.arrange(hmap[[5]], nrow = 1, ncol=1)grid.arrange(hmap[[6]], nrow = 1, ncol=1)We now have the set of maps (map A, map B & map C) which will be used to predict which map and cell each test record is assigned to.
5 Predict
Now once we have built the model, let us try to predict using our test dataset which cell and which layer each point belongs to.
predictLayerHVT(data,
map_A,
map_B,
map_C,
mad.threshold = 0.2,
normalize = T,
distance_metric="L1_Norm",
error_metric="max",
child.level = 1,
line.width = c(0.6, 0.4, 0.2),
color.vec = c("#141B41", "#6369D1", "#D8D2E1"),
yVar= NULL,
...)Each of the parameters of predictLayerHVT function has been explained below:
data- A dataframe containing the test dataset. The dataframe should have atleast one variable used for training. The variables from this dataset can also be used to overlay as heatmapmap A- A list of hvt.results.model obtained from HVT function while performing hierarchical vector quantization on train datamap B- A list of hvt.results.model obtained from HVT function while performing hierarchical vector quantization on train data with noveltymap C- A list of hvt.results.model obtained from HVT function while performing hierarchical vector quantization on train data without noveltychild.level- A number indicating the layer for which the heat map is to be plotted.(Only used if hmap.cols is not NULL)mad.threshold- A numeric values indicating the permissible Mean Absolute Deviationnormalize- A logical value indicating if the columns in your dataset should be normalized. Default value is TRUE.distance_metric- The distance metric can be ’Euclidean” or “Manhattan”. Euclidean is selected by default.error_metric- The error metric can be “mean” or “max”. mean is selected by defaultyVar- Name of the dependent variable(s)...- color.vec and line.width can be passed from here
5.1 Prediction Algorithm
The prediction algorithm recursively calculates the distance between each point in the test dataset and the cell centroids for each level. The following steps explain the prediction method for a single point in the test dataset :
- Calculate the distance between the point and the centroid of all the cells in the first level
- Find the cell whose centroid has minimum distance to the point
- Check if the cell drills down further to form more cells
- If it doesn’t, return the path. Or else repeat steps 1 to 4 till we reach a level at which the cell doesn’t drill down further
Let’s see which cell and layer each point belongs to.
validation_data <- testComputers
new_predict <- predictLayerHVT(
data=validation_data,
map_A,
map_B,
map_C,
normalize = F
)
DT::datatable(new_predict, rownames = FALSE,
class = 'cell-border stripe',
options = list(scrollX = TRUE))Now, lets understand the output of
predictLayerHVT function.
Layer1.Cell.ID- Contains Cell.ids from map A depending on the number of cells provided as inputLayer2.Cell.ID- Contains Cell.ids from map B (novelty) and map C.
6 Executive Summary
Example Usage1: Visualizing Multidimensional Data with Sammon’s Projection using Torus (Donut)
We have considered torus dataset for multidimensional data visualization using sammons projection.
We constructed a compressed HVT map (hvt.torus) by applying the HVT() on the torus dataset. We set the parameters as follows:
n_cells = 100,quant.error = 0.1, anddepth = 1. Upon analyzing the compression summary, we found that none of the 100 cells exceeded the quantization threshold error.We created another compressed HVT map (hvt.torus2) using the HVT() algorithm on the torus dataset. This time, we adjusted the parameters to
n_cells = 300,quant.error = 0.1, anddepth = 1. After examining the compression summary, we discovered that 14% of the cells hit the quantization threshold error.Once again, we generated a compressed HVT map (hvt.torus3) using the HVT() algorithm on the torus dataset. The parameters for this map were set to
n_cells = 900,quant.error = 0.1, anddepth = 1. Upon analyzing the compression summary, we found that 93% of the 100 cells hit the quantization threshold error and we can clearly visualize the 3D torus(donut) in 2D space.Example Usage2: Hierarchical VQ with Sammons Projection and Tessellation on Personal Computers Dataset
We have considered computers dataset for creating a predictive set of maps to monitor entities over time using predictLayerHVT() in this notebook
We construct a compressed HVT map (Map A) using the HVT() on the training dataset by setting
n_cellsto 1001 andquant.errorto 0.1Based on the output of the above step, we manually identify the novelty cell(s) from the plotted map A. For this dataset, we identify the 213th and 384th cells as the novelty cell.
We pass the identified novelty cell(s) as a parameter to the removeNovelty() along with HVT map A. The function removes that novelty cell(s) from the dataset and stores them separately. It also returns the dataset without novelty(s).
The plotCells() constructs hierarchical voronoi tessellations and highlights the identified novelty cell(s) in red
The dataset with novelty is then passed to the HVT() to construct another HVT map (map B). But here, we set the parameters
n_cells= 3,depth= 1 etc. when constructing the mapThe dataset without novelties is then passed to the HVT() to construct another HVT map (map C). But here, we set the parameters
n_cells= 1001,depth= 1 etc. when constructing the map.Finally, the set of maps - map A, map B, and map C are passed to the predictLayerHVT() along with the test dataset to predict which map and what cell each test record is assigned to.
The output of predictLayerHVT is a dataset with two columns Layer1.Cell.ID and Layer2.Cell.ID. Layer1.Cell.ID contains cell ids from map A in the form A1,A2,A3…. and Layer2.Cell.ID contains cell ids from map B as B1,B2… depending on the identified novelties and map C as C1,C2,C3…..
7 Applications
Pricing Segmentation - The package can be used to discover groups of similar customers based on the customer spend pattern and understand price sensitivity of customers
Market Segmentation - The package can be helpful in market segmentation where we have to identify micro and macro segments. The method used in this package can do both kinds of segmentation in one go
Anomaly Detection - This method can help us categorize system behavior over time and help us find anomaly when there are changes in the system. For e.g. Finding fraudulent claims in healthcare insurance
The package can help us understand the underlying structure of the data. Suppose we want to analyze a curved surface such as sphere or vase, we can approximate it by a lot of small low-order polygons in the form of tessellations using this package
In biology, Voronoi diagrams are used to model a number of different biological structures, including cells and bone microarchitecture
Using the base idea of Systems Dynamics, these diagrams can also be used to depict customer state changes over a period of time
8 References
Topology Preserving Maps : https://link.springer.com/chapter/10.1007/1-84628-118-0_7
Vector Quantization : https://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-450-principles-of-digital-communications-i-fall-2006/lecture-notes/book_3.pdf
Sammon’s Projection : http://en.wikipedia.org/wiki/Sammon_mapping
Voronoi Tessellations : http://en.wikipedia.org/wiki/Centroidal_Voronoi_tessellation
Embedding : https://en.wikipedia.org/wiki/Word_embedding